Skip to content

Add new DLM Frozen Tier Transition execution plugin and service#144595

Merged
lukewhiting merged 17 commits intoelastic:mainfrom
lukewhiting:est-2433-frozen-task-executor-service
Mar 26, 2026
Merged

Add new DLM Frozen Tier Transition execution plugin and service#144595
lukewhiting merged 17 commits intoelastic:mainfrom
lukewhiting:est-2433-frozen-task-executor-service

Conversation

@lukewhiting
Copy link
Copy Markdown
Contributor

@lukewhiting lukewhiting commented Mar 19, 2026

This PR adds a new X-Pack plugin for the DLM frozen tier and creates an execution framework for that transition.

Essentially there's a scheduler thread that checks for new indices with the "To transition" metadata maker every X minutes.

If it finds any, it checks if that index is already being transitioned and that there's available threads to execute the transition. If there is, it starts the transition on a thread in it's managed, fixed size thread pool. If not, the index is skipped if executing or all remaining indices are skipped if the thread pool is at capacity.

All this is gated behind a feature flag.

Also depends on #144511 to make Datas Streams plugin extensible / module exposed.

Fixes elastic/elasticsearch-team#2433

@lukewhiting lukewhiting requested review from a team as code owners March 19, 2026 17:04
@elasticsearchmachine elasticsearchmachine added v9.4.0 needs:triage Requires assignment of a team area label labels Mar 19, 2026
@lukewhiting lukewhiting added >non-issue :StorageEngine/Data streams Data streams and their lifecycles labels Mar 19, 2026
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Mar 19, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@lukewhiting lukewhiting changed the title Add new DLM Frozen Tier Transition executioplugin and service Add new DLM Frozen Tier Transition execution plugin and service Mar 19, 2026
@elastic elastic deleted a comment from coderabbitai bot Mar 19, 2026
@elastic elastic deleted a comment from coderabbitai bot Mar 19, 2026
@lukewhiting
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 19, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 19, 2026

📝 Walkthrough

Walkthrough

This change introduces a new x-pack plugin (dlm-frozen-transition) for managing concurrent data stream lifecycle frozen index transitions. It includes a capacity-limited executor (DlmFrozenTransitionExecutor) that tracks per-index transitions, a master-only service (DlmFrozenTransitionService) that polls cluster metadata and submits transition tasks when capacity allows, and supporting infrastructure including a new interface contract and plugin entry point. The EsExecutors class is also updated to treat queueCapacity == 0 as a synchronous handoff queue. Test suites validate executor concurrency semantics and service cluster-state integration.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionExecutor.java`:
- Around line 79-81: The catch block in DlmFrozenTransitionExecutor (the
try/catch where logger.error(() -> Strings.format("Error executing transition
for index [%s]", indexName), t) is called) currently catches Throwable which
swallows Error subclasses; change it to catch Exception (e.g., catch (Exception
e)) so only recoverable exceptions are handled and logged via logger.error, and
allow Errors to propagate after any finally cleanup has run; keep the existing
finally/cleanup logic untouched so it executes before propagation.

In
`@x-pack/plugin/dlm-frozen-transition/src/test/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionServiceTests.java`:
- Around line 93-97: The tearDown method currently calls super.tearDown() before
closing local resources which can leak; modify the
DlmFrozenTransitionServiceTests.tearDown method so that clusterService.close()
and terminate(threadPool) are executed first, then call super.tearDown() as the
last step to ensure local cleanup (clusterService and threadPool) happens before
superclass teardown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 826ca642-bb51-4e6e-bacb-32e50d28f6e9

📥 Commits

Reviewing files that changed from the base of the PR and between 9a04fdd and 0d1c2d8.

📒 Files selected for processing (11)
  • server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java
  • x-pack/plugin/dlm-frozen-transition/build.gradle
  • x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DataStreamLifecycleConvertToFrozen.java
  • x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionExecutor.java
  • x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionPlugin.java
  • x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionRunnable.java
  • x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionService.java
  • x-pack/plugin/dlm-frozen-transition/src/main/plugin-metadata/entitlement-policy.yaml
  • x-pack/plugin/dlm-frozen-transition/src/test/java/org/elasticsearch/xpack/dlm/frozen/DataStreamLifecycleConvertToFrozenMarkReadOnlyTests.java
  • x-pack/plugin/dlm-frozen-transition/src/test/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionExecutorTests.java
  • x-pack/plugin/dlm-frozen-transition/src/test/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionServiceTests.java

…earch/xpack/dlm/frozen/DlmFrozenTransitionServiceTests.java

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Luke, I left some comments.

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Future;

import static org.apache.logging.log4j.LogManager.getLogger;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be using org.elasticsearch.logging.LogManager, not log4j's

Copy link
Copy Markdown
Contributor Author

@lukewhiting lukewhiting Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched.

Out of curiosity, what's the history behind this? It seems like a pretty direct wrapper around slf4j and is used pretty sporadically through the codebase vs actual slf4j...

thread.setDaemon(true);
return thread;
});
schedulerThreadExecutor.scheduleAtFixedRate(this::checkForFrozenIndices, 0, pollInterval.millis(), TimeUnit.MILLISECONDS);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there being a reason why we preferred not to use scheduleAtFixedRate, but unfortunately I can't remember the reason. Is there a reason you didn't go with the SchedulerEngine the way that the existing DLM service (and many of our services) does? I.e.,

private void maybeScheduleJob() {
if (this.isMaster == false) {
return;
}
// don't schedule the job if the node is shutting down
if (isClusterServiceStoppedOrClosed()) {
logger.trace(
"Skipping scheduling a data stream lifecycle job due to the cluster lifecycle state being: [{}] ",
clusterService.lifecycleState()
);
return;
}
if (scheduler.get() == null) {
scheduler.set(new SchedulerEngine(settings, clock));
scheduler.get().register(this);
}
assert scheduler.get() != null : "scheduler should be available";
scheduledJob = new SchedulerEngine.Job(LIFECYCLE_JOB_NAME, new TimeValueSchedule(pollInterval));
scheduler.get().add(scheduledJob);
}

Then you have the other bits abstracted away instead of dealing with the executor directly. (And logging in case anything goes wrong)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed a bit overkill for just triggering a single method on a fixed schedule but it's something I could be persuaded on if you think there are advantages?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about the SchedulerEngine is that its invocations are rescheduled only after work has done, rather than having to deal with overlapping triggers. It also automatically uses the EsExecutors.daemonThreadFactory instead of having to set it on the thread itself here. It also has a more robust .stop() method that does shutdown, waits, can timeout, interrupts, etc without having to do it yourself.

Overall I think it's nice to abstract that bit away into the SchedulerEngine class more than manage size-1 threadpools in places.

Copy link
Copy Markdown
Contributor Author

@lukewhiting lukewhiting Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about the SchedulerEngine is that its invocations are rescheduled only after work has done, rather than having to deal with overlapping triggers.

I believe that's already the case with scheduleAtFixedRate. According to the docs:

If any execution of this task takes longer than its period, then subsequent executions may start late, but will not concurrently execute.

But I think the shutdown management stuff is worth it. I'll get that switched over.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah change of plan... I implemented this and it's made the class more complex 😅

It didn't simplify the shutdown / startup (actually made it 2 lines longer) plus it adds the requirement to extend an interface, register the class with the scheduler and the tests are more complex as it does not fire immediately on registration.

I really can't see an advantage of SchedulerEnginer over a modern ScheduledExecutorService... I'm guessing it's a holdover from before that existed? Perhaps I'm missing something else ES specific?

I'm going to stick with this for now. Long term, I think it will be easier for devs to maintain a standard JVM feature than something custom.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have done is made use of our ThreadPools.terminate(...) on the close() methods. This will made the case of a node being asked to shutdown a bit more graceful will still using the force interrupt shutdownNow() for when the node is de-elected as master.

@lukewhiting

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@seanzatzdev seanzatzdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comments, I realize a lot of this stuff is personal preference and subjective, but hopefully I can be convincing in some cases anyway :)

private final AtomicBoolean isMaster = new AtomicBoolean(false);
private volatile ScheduledExecutorService schedulerThreadExecutor;
private volatile DlmFrozenTransitionExecutor transitionExecutor;
private volatile boolean closing = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference is to call this "running" to avoid having to read it as "if not closing" below, but it is only a personal preference, no need to change if you prefer it as "closing"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I'm in a bit of a contrary mood on this PR but I'm going to disagree again 😅

We check both closing == true and closing == false in this class plus closing in a thing that is happening vs closed which is a state and we are tracking the former so I think it fits better?

Again, I'm not married to either, more laying out my logic.

thread.setDaemon(true);
return thread;
});
schedulerThreadExecutor.scheduleAtFixedRate(this::checkForFrozenIndices, 0, pollInterval.millis(), TimeUnit.MILLISECONDS);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about the SchedulerEngine is that its invocations are rescheduled only after work has done, rather than having to deal with overlapping triggers. It also automatically uses the EsExecutors.daemonThreadFactory instead of having to set it on the thread itself here. It also has a more robust .stop() method that does shutdown, waits, can timeout, interrupts, etc without having to do it yourself.

Overall I think it's nice to abstract that bit away into the SchedulerEngine class more than manage size-1 threadpools in places.

@lukewhiting lukewhiting force-pushed the est-2433-frozen-task-executor-service branch from bdee0af to aae1e77 Compare March 24, 2026 11:51
@lukewhiting lukewhiting requested a review from dakrone March 24, 2026 16:16
Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left a few more comments but nothing that would require a re-review, thanks for the iteration on this!

@lukewhiting lukewhiting enabled auto-merge (squash) March 25, 2026 14:02
@lukewhiting lukewhiting disabled auto-merge March 25, 2026 14:51
…earch/xpack/dlm/frozen/DlmFrozenTransitionService.java

Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
@lukewhiting lukewhiting enabled auto-merge (squash) March 25, 2026 14:52
@lukewhiting lukewhiting merged commit 07bc331 into elastic:main Mar 26, 2026
36 checks passed
@lukewhiting lukewhiting deleted the est-2433-frozen-task-executor-service branch March 26, 2026 10:30
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 26, 2026
* upstream/main: (146 commits)
  Revert "[Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539)"
  Fix ArrayIndexOutOfBoundsException in fetch phase with partial results (elastic#144385)
  ESQL: Correctly manage NULL data type for SUM (elastic#144942)
  [ESQL] Fixes GroupedTopNBenchmark not executing (elastic#144944)
  Fix reader context leak when query response serialization fails (elastic#144708)
  Validate individual offset values in BULK_OFFSETS bounds checks (elastic#144643)
  Merge main21 source set into main in simdvec (elastic#144921)
  [TEST] Unmute TsidExtractingIdFieldMapperTests (elastic#144848)
  [Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539)
  Fix `ThreadedActionListenerTests#testRejectionHandling` (elastic#144795)
  Add new DLM Frozen Tier Transition execution plugin and service (elastic#144595)
  Prometheus: execute query_range via parsed EsqlStatement plan (elastic#144416)
  Investigate `testBulkIndexingRequestSplitting` failure (elastic#144766)
  Add test utility for wrapping directories in FilterDirectory layer (elastic#143563)
  Fix ES|QL decay tests with negative scale (elastic#144657)
  Fix circuit breaker leak in percolator query construction (elastic#144827)
  Use XPerFieldDocValuesFormat in AbstractTSDBSyntheticIdCodec (elastic#144744)
  [DOCS] Document how reindex work in CPS (elastic#144016)
  Fix Int4 vector library tests failing on Java 21 (elastic#144830)
  [DiskBBQ] Fix index sorting on flush (elastic#144938)
  ...
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 26, 2026
…tic#144595)

* Add new DLM Frozen Tier Transition executioplugin and service

* Revert inspection profile change

* Update x-pack/plugin/dlm-frozen-transition/src/test/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionServiceTests.java

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* PR changes

* Fix flaky test

* Move new tests to correct plugin

* Fix plugin errors on startup

* Improve thread pool termination and ensure proper clean-up during close

* PR Changes

* Update x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionService.java

Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
…tic#144595)

* Add new DLM Frozen Tier Transition executioplugin and service

* Revert inspection profile change

* Update x-pack/plugin/dlm-frozen-transition/src/test/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionServiceTests.java

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* PR changes

* Fix flaky test

* Move new tests to correct plugin

* Fix plugin errors on startup

* Improve thread pool termination and ensure proper clean-up during close

* PR Changes

* Update x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionService.java

Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
# Conflicts:
#	x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DataStreamLifecycleConvertToFrozen.java
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
…tic#144595)

* Add new DLM Frozen Tier Transition executioplugin and service

* Revert inspection profile change

* Update x-pack/plugin/dlm-frozen-transition/src/test/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionServiceTests.java

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* PR changes

* Fix flaky test

* Move new tests to correct plugin

* Fix plugin errors on startup

* Improve thread pool termination and ensure proper clean-up during close

* PR Changes

* Update x-pack/plugin/dlm-frozen-transition/src/main/java/org/elasticsearch/xpack/dlm/frozen/DlmFrozenTransitionService.java

Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants